Purpose

Tired of hearing about the latest Facebook or LinkedIn data leak? Get in on scraping (non-nefarious) data yourself!

In this post I want to show how easy it is to scrape data from the internet. It is the first post in a series looking at ships. I walk through scraping data from Wikipedia, one of the best places on the internet to ingest tabular data from.

Before we begin with the walkthrough, I want to show what the output looks like, and show how cargo ships have become larger over time.

How have cargo ships increased in size?

The plot below shows the evolution of ship size from 1870 to today. On the x-axis is the length of the average cargo ship per decade from Wikipedia’s list of cargo ships. on the y-axis is the average ship beam, or width at the widest point. The colour of the rectangle shows the deadweight tonnage of the average ship, or amount of cargo that the ship can carry.

Cargo ships have increased dramatically in size over time! The oldest ship in our dataset is the R. J. Hackett, one of the first Great Lakes freighters. It was just 63m long and 10m wide, with a wooden hull. According to historian Mark Thompson, the R. J. Hackett’s boxy hull, hatch-lined deck, and placement of the deckhouses meant the ship was ideally suited for moving cargo through inland waterways. This steamer greatly influenced the development of cargo ships which followed.

Today, container ships like the Ever Given are nearly 400m long, 60m wide, and can carry more than 20,000 TEUs. That’s enough space for 745 million bananas!

Increasing size of container ships by tonnage, lenght, and width

In the plots below we see focus only on container ships built after 1970. This era saw the construction of the first ships purpose built to carry ISO containers, which could be loaded and unloaded rapidly at port, repacked and shipped onward on any compatible container ship. The ISO standard container transformed the shipping process and replaced the prior status quo break bulk carriers.

Tonnage over time

How have cargo ship deadweight tonnages, or how much cargo a ship can carry, changed over time? Mouse over a point to see the name of the ship.

Container ships can carry more cargo today than ever before. It’s hard to get my mind around 220 000 tons of cargo!

Width over time

How have cargo ship beams, or widths of ships at their widest point, changed over time? Mouse over a point to see the name of the ship.

Container ships have also become wider, with lumping at 32m, 40m and 59m.

Length over time

How has the length of cargo ships changed over time? Mouse over a point to see the name of the ship.

The Ever Given is among the longest container ships operating today at 400m in length. The linear fit line shows that there has been a steady increase in container ship length over time.

Birfucation in cargo ship size

So it certainly seems that cargo ships have been becoming larger over time. Interestingly, it appears that while the largest container ships continue to get larger, there is still a need for relatively small ships. There are a significant number of container ships that can carry less than 50 000 tons launched since 2010, shown in the density plot below. We could say that there has been a bifurcation in ship size, with a few enormous ships, and a greater number of smaller ships operating in tandem today.

Scraping wikipedia

Now that we have had a look at the data, I want to walk through how it can easily be collected and processed for visualizing.

The source of the historic data was Wikipedia’s list of cargo ships, a screenshot of which I include below.

The list contains the names of the cargo ships in alphabetical order. We want to grab the links to each article from the list. We can use the SelectorGadget tool to highlight the CSS that leads to each ship’s page. SelectorGadget is an open source tool that makes CSS selector generation and discovery on complicated sites a breeze. It allows you to point to a web page element with the mouse and find the CSS selector for that element. It highlights everything matched by the selector.

I show here a picture of the interface with Google’s Chrome browser:

Once we have the path we want to collect the links from, we can use the rvest package to scrape the data. Written by Hadley Wickham, this is a package that makes it easy to scrape data from HTML web pages.

We start with the url of the list:

Function to grab the ship page URLs from the list of cargo ships.

Next we write a function that gets the list of links from the page. We begin by reading the HTML from the link, then selecting the nodes with the links, and select the attribute called “href” – the url of the page for each ship. We format the output as a tibble, a data frame object that is convenient to work with. Notably this function will work for any list of pages on Wikipedia, neat!

Here we apply the function to our link and get back a list of links to each page.

## # A tibble: 1,024 x 1
##    value
##    <chr>
##  1 #top 
##  2 #0–9 
##  3 #A   
##  4 #B   
##  5 #C   
##  6 #D   
##  7 #E   
##  8 #F   
##  9 #G   
## 10 #H   
## # ... with 1,014 more rows

We can see we get 1 024 links back, but the problem is that there are multiple instances of the Table of Contents links, “#A”, “#B” etc. We will filter these out by only selecting links with 5 or more characters.

## # A tibble: 268 x 1
##    url                                                                          
##    <chr>                                                                        
##  1 https://en.wikipedia.org//w/index.php?title=Al_Rekayyat,_LNG_Tanker&action=e~
##  2 https://en.wikipedia.org//wiki/Algosteel                                     
##  3 https://en.wikipedia.org//wiki/SS_Amasa_Stone                                
##  4 https://en.wikipedia.org//wiki/MV_Algocape                                   
##  5 https://en.wikipedia.org//wiki/Algorail                                      
##  6 https://en.wikipedia.org//wiki/Algosoo_(1974_ship)                           
##  7 https://en.wikipedia.org//wiki/Algolake                                      
##  8 https://en.wikipedia.org//wiki/Algoma_Equinox                                
##  9 https://en.wikipedia.org//wiki/MV_Buffalo                                    
## 10 https://en.wikipedia.org//wiki/Algoma_Compass                                
## # ... with 258 more rows

Now that we have the list of links we can get the data about each ship from the page with a similar function. We want it’s date of launch, it’s type, tonnage, length, beam and status. This is all stored helpfully in the infobox on each page:

SelectorGadget helps us out again, returning a path to the infobox:

Scraping each page

# mapping through each url
df <- list_of_links %>%
        # the possibly statement here means that we will record if there is a failure, for example if there is no infobox in an article.
        mutate(text = map(url, possibly(get_ship_info_wiki_list, "failed")))

First we remove the ships that failed, as they will have an NA in the text field.

df <- df %>% 
  unnest(text) %>% 
  filter(is.na(text)) %>% 
  select(-text)

Next we select the information from the infobox that we want to keep.

df_wide <- df_unnest %>% 
  mutate(value = str_squish(value)) %>% 
  group_by(url) %>% 
  mutate(row_id = row_number()) %>%
  filter(title %in% c("Name:",
           "Launched:",
           "Status:",
           "Tonnage:",
           "Length:",
           "Beam:")) %>%
  # then we pivot wider so that each ship is one row
  pivot_wider(names_from = title, values_from = value) %>% 
  # cleaning the names makes it easier to use these columns later
  janitor::clean_names() %>% 
  ungroup() %>% 
  select(-row_id) %>% 
  group_by(url) %>% 
  # removing the superfluous columns
  summarise_all(funs(na.omit(.)[1]))

write_rds(df_wide, "data/cargo_ship_info.rds")

Here is the output